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Hi, I’m Andreas! 

• One of the Engine leads at Insomniac Games 

• Heading up the fearless “FedEx” Core group 

• Our focus: Engine runtime + infrastructure 

• But also involved in gameplay optimization 

• We're all about getting the job done 
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Talk Outline 

• SSE Intro 

• Gameplay Example 

• Techniques (tricks!) 

• Best Practices 

• Resources + Q & A 
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SIMD at Insomniac Games 

• Long history of SIMD programming in the studio 

. PS2 VU, PS3 SPU+Altivec, X360 VMX128, SSE(+AVX) 

• Focus on SSE programming for this cycle 

• Even bigger incentive when PCs+consoles share ISA 

• PC workstations are ridiculously fast when you use SIMD 


Lots of old best practices don’t apply to SSE 
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Why CPU SIMD? 

• Isn't everything GPGPU nowadays? 

• Definitely not! 

• Don't want to waste the x86 cores on current consoles 

• Many problems are too small to move to GPU 
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Why CPU SIMD? 

• Isn't everything GPGPU nowadays? 

• Definitely not! 

• Don't want to waste the x86 cores on current consoles 

• Many problems are too small to move to GPU 


Never underestimate brute force + linear access 

• CPU SIMD can give you massive performance boosts 

• Don't want to leave performance on the table 
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SIMD 
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SIMD 
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SIMD 
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SIMD 


Input Data 
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SIMD 



.. It s just like dicing veggies! 
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Options for SSE and AVX SIMD 

• Compiler auto-vectorization 

• Intel ISPC 

• Intrinsics 

• Assembly 



GAME DEVELOPERS CONFERENCE® 2G15 


MARCH 2-6, 2015 GDCONF.COM 


Compiler Auto-vectorization 

• Utopian idea, doesn’t work well in practice 

• Compilers are tools, not magic wands 
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Compiler Auto-vectorization 

• Utopian idea, doesn’t work well in practice 

• Compilers are tools, not magic wands 

• Often breaks during maintenance 

• Left with a fraction of the performance! 

• Compiler support/guarantees = terrible 

• No support in VS2012, some in VS2013 

• Different compilers have different quirks 



GAME DEVELOPERS CONFERENCE® 2G15 


Auto-vectorization 


float Foo(const float input[], int n) 

{ 

float acc = 0 . f ; 

for (int i = 0 ; i < n; ++i) { 
acc += input[i] ; 

} 

return acc; 

} 
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GCC 4.8: Vectorized 
Clang 3.5: Vectorized 
(Only with -ffast-math) 
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Let’s make some changes.. 


float Foo(const float input[], int n) 

{ 

float acc = 0 . f ; 

for (int_ i = _ 0 ; i < n; ++i) { 
float f = input[i] ; 
if (f < 10. f) 
acc += f; 

} 

return acc; 

} 


GCC 4.8: Vectorized 
Clang 3.5: Vectorized 
(Both branch free) 
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Auto-vectorization gone wrong 


float Foo(const float input[], int n) 

{ 

float acc = 0 . f ; 

for (int i = 0 ; i < n; ++i) { 


float 

f 

= input[i] ; 

if (f 

< 

10. f) 

acc 

+= 

f; 

else 



acc 

— — 

f; 


} 


return acc; 


GCC 4.8: scalar + branchy 
Clang 3.5: scalar + branchy 



} 
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ISPC 

• Shader-like compiler for SSE/AVX 

• Write scalar code, ISPC generates SIMD code 

• Requires investment in another level of abstraction 

• Caution-easy to generate inefficient load/store code 
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ISPC 

• Shader-like compiler for SSE/AVX 

• Write scalar code, ISPC generates SIMD code 

• Requires investment in another level of abstraction 

• Caution-easy to generate inefficient load/store code 

• Main benefit: automatic SSE/AVX switching 

• Example: Intel's BCT texture compressor 

• Automatically runs faster on AVX workstations 
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Intrinsics 

• Taking control without dropping to assembly 

• Preferred way to write SIMD at Insomniac Games 

• Predictable— no invisible performance regressions 

• Flexible-exposes all CPU features 

• Hard to learn & get going.. 

• Not a real argument against 

• All good programming is hard (and bad programming easy) 
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Assembly 

• Always an option! 
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Assembly 

• Always an option! 

• No inline assembly on 64-bit VS compilers (p 

• Need external assembler (e.g. yasm) 

• Numerous pitfalls for the beginner 

• Tricky to maintain ABI portability between OSs 

• Non-volatile registers 

• x64 Windows exception handling 

• Stack alignment, debugging, ... 
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Why isn’t SSE used more? 

• Fear of fragmentation in PC space 

• SSE2 supported on every x64 CPU, but usually much more 



GAME DEVELOPERS CONFERENCE® 2D15 


MARCH 2-6, 2015 GDCONF.COM 


Why isn’t SSE used more? 

• Fear of fragmentation in PC space 

• SSE2 supported on every x64 CPU, but usually much more 

• “It doesn’t fit our data layout” 

• PC engines traditionally OO-heavy 

• Awkward to try to bolt SIMD code on an 00 design 
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Why isn’t SSE used more? 

• Fear of fragmentation in PC space 

• SSE2 supported on every x64 CPU, but usually much more 

• “It doesn’t fit our data layout” 

• PC engines traditionally OO-heavy 

• Awkward to try to bolt SIMD code on an 00 design 


“We’ve tried it and it didn’t help” 
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“We’ve tried it and it didn’t help” 

• Common translation: “We wrote class Vec4...” 

class Vec4 { 

ml28 data; 



operators (...) 
operator- (...) 
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class Vec4 




ml 28 has X/Y/Z/W. 


So clearly it’s a 4D vector. 
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class Vec4, later that day 

r 



Addition and multiply is going great! 

This will be really fast! 
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class Vec4, that night 




Oh no! 

Dot products and other common operations 

are really awkward and slow! 
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The Awkward Vec4 Dot Product 

??? Vec4Dot(Vec4 a, Vec4 b) 

{ 

ml28 a0 = _mm_mu"L_ps(a.data, b.data); 

// Wait, how are we going to add the products together? 

return ???; // And what do we return? 


} 
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The Awkward Vec4 Dot Product 


Vec4 Vec4Dot(Vec4 a, Vec4 b) 

{ 

ml28 a0 = _mm_mul_psCa.data, b.data); 

ml28 al = _mm_shuffle_ps(a0, a0, _MM_SHUFFLE(2, 3, 0, 1)); 

ml28 a2 = _mm_add_ps(al, a0); 

ml28 a3 = _mm_shuffle_ps(a2, a2, _MM_SHUFFLE(0, 1, 3, 2)); 

ml28 dot = _mm_add_ps(a3, a2); 

return dot; // WAT: the same dot product in all four lanes 

} 
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class Vec4, the next morning 



Well, at least we’ve tried it. 
I guess SSE sucks. 


GAME DEVELOPERS CONFERENCE® 2G15 


class Vec4 


Wrong conclusion 
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SSE sucks because Vec4 is slow 
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class Vec4 


• Wrong conclusion: SSE sucks because Vec4 is slow 


Correct conclusion: The whole Vec4 idea sucks 
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Non-sucky SSE Dot Products (plural) 


ml28 dx 
ml28 dy 
ml28 dz 
ml28 dw 


= _mm_mul_ps(ax, 
= _mm_mul_ps(ay , 
= _mm_mul_ps(az, 
= _mm_mul_ps(aw, 


ml28 a0 = _mm_add_ps(dx, 
ml28 a 1 = _mm_add_ps(dz, 
ml28 dots = _mm_add_ps(a0, 


bx); 

// 

dx = 

ax 

* 

bx 

by); 

// 

dy = 

ay 

* 

by 

bz); 

// 

dz = 

az 

* 

bz 

bw); 

// 

dw = 

aw 

* 

bw 

dy); 

// 

a0 = 

dx 

+ 

dy 

dw); 

// 

al = 

dz 

+ 

dw 

al); 

// 

dots 

= a0 

+ al 
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Non-sucky SSE Dot Products (plural) 


ml28 dx 
ml28 dy 
ml28 dz 
ml28 dw 

ml28 a0 
ml28 a 1 
ml28 dots 



dx = ax * bx 
dy = ay * by 
dz = az * bz 
dw = aw * bw 

a0 = dx + dy 
al = dz + dw 
dots = a0 + al 
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Don’t waste time on SSE classes 

• Trying to abstract SOA hardware with AOS data 

• Doomed to be awkward & slow 
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SSE code wants to be free! 

► Best performance without wrappers or frameworks 
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Don’t waste time on SSE classes 

• Trying to abstract SOA hardware with AOS data 

• Doomed to be awkward & slow 

• SSE code wants to be free! 

• Best performance without wrappers or frameworks 


Just write small helper routines as needed 
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“It doesn’t fit our data layout” 

• void SpawnParticle(float pos[3], ...); 

• Stored in struct Particle { float pos[3]; ... } 

• Awkward to work with array of Particle in SSE 
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“It doesn’t fit our data layout” 

• void SpawnParticle(float pos[3], ...); 

• Stored in struct Particle { float pos[3]; ... } 

• Awkward to work with array of Particle in SSE 


So don’t do that 

► Keep the spawn function, change the memory layout 

► The problem is with struct Particle, not SSE 
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Struct Particle in memory (AOS) 


Position XYZ 

Age 

Velocity XYZ 

Other Junk 

Material Data 

Acceleration XYZ 

Position XYZ 

Age 

Velocity XYZ 

Other Junk 

Material Data 

Acceleration XYZ 

Position XYZ 

Age 

Velocity XYZ 

Other Junk 


Material Data 


Acceleration XYZ 
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Struct Particle in memory (AOS) 


Position XYZ 

Age 

Velocity XYZ 

Other Junk 

Material Data 

Acceleration XYZ 

| Position XYZ 

Age 

Velocity XYZ 

Other Junk 

| Material Data 

Acceleration XYZ 

Position XYZ 

Age 

Velocity XYZ 

Other Junk 


Material Data 


Acceleration XYZ 
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Particles in memory (SOA) 


Pos X 

Pos X 

Pos X 

Pos X 

Pos Y 

Pos Y 

Pos Y 

Pos Y 

Pos Z 

Pos Z 

Pos Z 

Pos Z 

Age 

Age 

Age 

Age 



Vel X 


Vel X 


Vel X 


Vel X 
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Particles in memory (SOA) 


I l 


Pos X 

Pos X 

Pos X 

Pos X 

Pos Y 

1 Pos Y 1 

Pos Y 

Pos Y 

Pos Z 

[PosTl 

Pos Z 

Pos Z 

Age 

Age 

Age 

Age 

Vel X 

Vel X 

Vel X 

Vel X 
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Particles in memory (SOA) 


Pos X I Pos X I Pos X I Pos X 

Pos Y I Pos Y I Pos Y I Pos Y 

Pos Z I Pos Z I Pos Z I Pos Z 

Age I Age I Age I Age 

Vel X I Vel X I Vel X I Vel X 
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Particles in memory (SOA) 


I l 


Pos X 

Pos X 

Pos X 

Pos Y 

1 Pos Y 1 

Pos Y 

Pos Z 

1 Pos Z 1 

Pos Z 

Age 

Age 

Age 

Vel X 

1 Vel X 1 

Vel X 
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Data Layout Choices 

• SOA form usually much better for SSE code 

• Maps naturally to instruction set 

• SOA SIMD code maps closely to scalar reference code 
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• AOS form usually better for scalar problems 

• Especially for lookup or indexing algorithms 

• Single cache miss to get at group of values 
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Data Layout Choices 

• SOA form usually much better for SSE code 

• Maps naturally to instruction set 

• SOA SIMD code maps closely to scalar reference code 

• AOS form usually better for scalar problems 

• Especially for lookup or indexing algorithms 

• Single cache miss to get at group of values 

• Generate SOA data locally in transform if needed 

• Trade off SIMD efficiency by shuffling in/out 
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Case Study: Doors 

• Doors to open themselves automatically 

• When actor of right “allegiance” is within some radius 

• Think Star Trek doors 

• Typical game problem 
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Case Study: Doors 

• Doors to open themselves automatically 

• When actor of right “allegiance” is within some radius 

• Think Star Trek doors 

• Typical game problem 

• Initially implemented as an 00 solution 

• Started to pop on the performance radar 

• -100 doors x -30 characters to test against = 3,000 tests! 
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Original Door Update 

void Door : : UpdateC loot dt) 

{ 

ActorList all_characters = GetAllCharactersQ ; 

bool should_open = c alse; 

for (Actor* actor : all_characters) { 

if (AllegianceComponent* c = actor->FindComponent<AllegianceComponent>Q) { 
if (c->GetAllegianceQ == m_Allegiance) { 

if (VecDistanceSquared(a->GetPosition() , this->GetPosition()) < m_OpenDistanceSq) { 
should_open = true; 
break; 

} 

} 

} 

} 

• • • 


} 
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Original Door Update 


Scalar by definition 


Void D00r..upuuL 

{ 

ActorList all_characters = GetAllCharactersQ ; 


bool shoulcLopen = false; 

for (Actor* actor : all_characters) { 

if (AllegianceComponent* c = actor->FindComponent<AllegianceComponent>()) { 
if (c->GetAllegiance() == m_Allegiance) { 

if (VecDistanceSquared(a->GetPosition() , this->GetPosition()) < m_OpenDistanceSq) { 
should_open = true; 
break; 

} 

} 

} 

} 


} 
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Original Door Update 


void Doo. 

{ 



Repeated work 


ActorList all_characters = GetAllCharactersO; 


bool should_open = false; 

for (Actor* actor : all_characters) { 

if (AllegianceComponent* c = actor->FindComponent<AllegianceComponent>()) { 
if (c->GetAllegiance() == m_Allegiance) { 

if (VecDistanceSquared(a->GetPosition() , this->GetPosition()) < m_OpenDistanceSq) { 
should_open = true; 
break; 

} 

} 

} 

} 


} 
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Original Door Update 


Scalar by definition 


Void D 00 r..LipuuL 

{ 

ActorList all_characters = GetAllCharactersQ ; 


Repeated work 


bool should_open = : alse; 


Multiple L2 misses 


for (Actor* actor : all_characters) { 

if (AllegianceComponent* c = actor->FindComponent<AllegianceComponent>()) { 
if (c->GetAllegianceQ == m_Allegiance) { 

if (VecDistanceSquared(a->GetPosition() , this->GetPosition()) < m_OpenDistanceSq) { 
should_open = true; 
break; 

} 

} 

} 

} 


• • • 


} 
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Original Door Update 


Scalar by definition 


Void D 00 r..LipuuL 

{ 

ActorList all_characters = GetAllCharactersQ ; 


Repeated work 


bool 


Not using all data 




Multiple L2 misses 


for (Actor* actor : a l ". characters) { 

if (Allegiano'Component* c = actor->FindComponent<AllegianceComponent>0) { 
if (c->GetAllegianceQ == m_Allegiance) { 

if (VecDistanceSquared(a->GetPositionO , this->GetPositionQ) < m_OpenDistanceSq) { 
should_open = true; 
break; 

} 

} 

} 

} 


• • • 


} 
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Original Door Update 


Scalar by definition 


Void D 00 r..LipuuL 

{ 

ActorList all_characters = GetAllCharactersQ ; 


Repeated work 


bool 


Not using all data 




Multiple L2 misses 


for (Actor* actor : a l ". characters) { 

if (Allegiano'Component* c = actor->FindComponent<AllegianceComponent>0) { 
if (c->GetAllegianceQ == m_Allegiance) { 

if (VecDistanceSquared(a->GetPositionO , this->GetPositionQ) < m_OpenDistanceSq) { 
should ope n = tru( : 
break; 

} 

} 

} 

} 


Scalar compute 


• • • 


} 
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Input Data in Original Update 
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Input Data in Original Update 
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Input Data in Original Update 


Door Actor 

m_ 

_Positioi 


l 


Hash Ta 




Door Co 

m_AlleQ& 


l wker 

li of twin: 

World’s Largest 


• • 


1988 CIRCUMFERENCE 40’ 

14,687 lbs. 


Char Actor 

m Position 


/ 


Hash Table 


J 


Component 

ice 
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Input Data in Original Update 
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Input Data in Original Update 



1 byte 


1 byte 
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What is it actually computing? 

for each door: 

door.should_be_open = 0 
for each character: 

if InRadiusC-O and door. team == char. team: 
door.should_be_open = 1 
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What’s the radius test computing? 

• Tests if a point is within a sphere 

• Inputs: Two points xO,yO,zO and xl ,y1 ,z1 + a squared radius 

• Outputs: yes or no 

• (It’s just trying to avoid a square root) 
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What’s the radius test computing? 

• Tests if a point is within a sphere 

• Inputs: Two points xO,yO,zO and xl ,y1 ,z1 + a squared radius 

• Outputs: yes or no 

• (It’s just trying to avoid a square root) 


(xo-xi ) 2 + (yo-yi) 2 + (zo-zi ) 2 <= r 2 
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What’s the radius test computing? 

• Tests if a point is within a sphere 

• Inputs: Two points xO,yO,zO and xl ,y1 ,z1 + a squared radius 

• Outputs: yes or no 

• (It’s just trying to avoid a square root) 

(xo-xi ) 2 + (yo-yi) 2 + (zo-zi ) 2 <= r 2 


OK, now we understand all the pieces 
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SIMD Prep Work 

• Move door data to central place 

• Really just a bag of values in SOA form 

• Good approach as doors are rarely created & destroyed 

• Each door has an index into central data stash 
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SIMD Prep Work 

• Move door data to central place 

• Really just a bag of values in SOA form 

• Good approach as doors are rarely created & destroyed 

• Each door has an index into central data stash 


Build actor tables locally in update 

• Once per update, not 100 times 

• Stash in simple array on stack (alloca for variable size) 
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Door Update Data Design 


// In memory, SOA 
struct DoorData { 
uint32_t Count; 
flout *X; 

floot *Y ; 

flout *Z; 

flout *RudiusSq; 

uint32_t *Allegiunce; 

// Output dutu 
uint32_t *ShouldBeOpen; 
} s_Doors; 


// On the stuck, AOS 
struct ChurDutu { 
flout X; 

flout Y; 

flout Z; 

uint32_t Allegiunce; 
} c [MAXCHARS] ; 
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SIMD Door Update 

• New update does all doors in one go 

• Test 4 doors vs 1 actor in inner loop 
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SIMD Door Update 

• New update does all doors in one go 

• Test 4 doors vs 1 actor in inner loop 


Massive benefits from the data layout 

> All compute naturally falls out as SIMD operations 
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Outer Loop Prologue 



for (int d = 0 ; d < door_count; d += 4 ) { 


ml28 dx 

ml28 dy 

ml28 dz 

ml28 dr 

ml28i da 


= _mm_load_ps(&s_Doors .X[d]) ; 

= _mm_load_ps(&s_Doors .Y[d]) ; 

= _mm_load_ps(&s_Doors .Z[d]) ; 

= _mm_load_psC&s_Doors . RadiusSq [d] ) ; 

= _mm_load_sil28(( ml28i*) &s_Doors .Allegiance [d]) ; 


ml28i state = _mm_setzero_sil280; 


Load attributes for 4 doors, clear 4 “open” accumulators 
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Inner Loop Prologue 

• • • 


for (int cc = 0 ; cc < char_count; ++cc) { 

ml28 char_x = _mm_broadcast_ss(&c[cc] .x); 

ml28 char_y = _mm_broadcast_ss(&c[cc] .y); 

ml28 char_z = _mm_broadcast_ss(&c[cc] .z); 

ml28i char_a = _mm_setl_epi32(c[cc] .allegiance); 


Load attributes for 1 character, broadcast to all 4 lanes 
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Inner Loop Math 

• • • 

ml28 

ddx 

= _mm_sub_ps(dx, char_x); 

ml28 

ddy 

= _mm_sub_ps(dy , char_y); 

ml28 

ddz 

= _mm_sub_ps(dz, char_z); 

ml28 

dtx 

= _mm_mul_ps(ddx, ddx); 

ml28 

dty 

= _mm_mul_ps(ddy , ddy); 

ml28 

dtz 

= _mm_mul_ps(ddz, ddz); 

ml28 

dst 

= _mm_add_ps(_mm_add_ps(dtx, dty), 


dtz) ; 


Compute squared distance between character & the 4 doors 
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Inner Loop Epilogue 



ml28 rmask = _mm_cmple_psCdst , dr); 

ml28i amask = _mm_cmpeq_epi32(da, char_a); 

ml28i mask = _mm_and_sil28(_mm_castps_sil28(amask), rmask); 

state = _mm_or_sil28(mask, state); 

} 


Compare against door open radii AND allegiance => OR into state 
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Outer Loop Epilogue 


_mm_store_sil28(( ml28i*) &s_Doors . ShouldBeOpen[d] , state); 



Store “should open” for these 4 doors, ready for next group of 4 
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for (int d = 0 ; d < door_count; d += 4 ) { 


_ml28 dx 
_ml28 dy 
_ml28 dz 
_ml28 dr 
_ml28i da 
_ml28i state = 


_mm_load_ps(&s_Doors .X[d]) ; 

_mm_load_ps(&s_Doors . Y[d]) ; 

_mm_load_ps(&s_Doors . Z [d] ) ; 

_mm_load_ps(&s_Doors . RadiusSq[d]) ; 

_mm_load_sil28(( ml28i*) &s_Doors . Allegiance[d]) ; 

_mm_setzero_sil28Q ; 


for (int cc = 0 ; cc < char_count; ++cc) { 


ml28 

char_x 

= 

_mm_broadcast_ss(&c[cc] .x); 

ml28 

char_y 

= 

_mm_broadcast_ss(&c[cc] .y); 

m!28 

char_z 

= 

_mm_broadcast_ss(&c[cc] .z); 

•i- 1 

00 

pm 

E 

1 

1 

char_a 

— 

_mm_setl_epiB2(c[cc] .allegiance) ; 

ml28 

ddx 

= 

_mm_sub_ps(dx, char_x); 

ml28 

ddy 

= 

_mm_sub_ps(dy , char_y); 

ml28 

ddz 

— 

_mm_sub_ps(dz , char_z); 

ml28 

dtx 

= 

_mm_mul_ps(ddx, ddx); 

ml28 

dty 

— 

_mm_mul_ps(ddy , ddy); 

ml28 

dtz 

— 

_mm_mul_ps(ddz , ddz); 

ml28 

dst 

— 

_mm_add_psC_mm_add_ps(dtx, dty), dtz); 

ml28 

rmask 

= 

_mm_cmple_ps(dst , dr); 

m!28i 

amask 

= 

_mm_cmpeq_epiB2(da, char_a); 

•i- 1 

00 

r\j 

iH 

E 

1 

1 

mask 

= 

_mm_and_sil28C_mm_castps_sil28(amask) , rmask) ; 


state 

= 

_mm_or_sil28(mask, state); 


_mm_store_sil28(( m!28i*) &s_Doors . ShouldBeOpen[d] , state); 
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Inner Loop Code Generation 


vbroadcastss 

xmm6, 

vbroadcastss 

xmm7, 

vbroadcastss 

xmml, 

vbroadcastss 

xmm2, 

vsubps 

xmm6, 

vsubps 

xmm7, 

vsubps 

xmml, 

vmulps 

xmm6, 

vmulps 

xmm7, 

vmulps 

xmml, 

vaddps 

xmm6, 

vaddps 

xmml, 

vempps 

xmml, 

vpcmpeqd 

xmm2, 

vpand 

xmml, 

vpor 

xmm0, 

add 

rex, 

dec 

edi 

jnz 

. Loop 


dword ptr [rcx-8] 
dword ptr [rcx-4] 
dword ptr [rex] 
dword ptr [rcx+4] 
xmm8 , xmm6 
xmm9 , xmm 7 
xmm3 , xmml 
xmm6 , xmm6 
xmm 7, xmm7 
xmml, xmml 
xmm6, xmm7 
xmm6, xmml 
xmml, xmm4, 2 
xmm5 , xmm2 
xmm2 , xmml 
xmml , xmm0 
10h 


-6 cycles per 4 door/actor tests 
1 00 doors x 30 actors = -4500 c 
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• Even more possible now that we can reason about the data 



GAME DEVELOPERS CONFERENCE® 2D15 


MARCH 2-6, 2015 GDCONF.COM 


Door Results 

• 20- lOOx speedup 

• Even more possible now that we can reason about the data 

• Brute force SIMD for “reasonable # of things” 

• Lots of “reasonable # of things” problems in a game! 

• Removing cache misses + SIMD ALU can be huge win 
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Door Results 

• 20- lOOx speedup 

• Even more possible now that we can reason about the data 

• Brute force SIMD for “reasonable # of things” 

• Lots of “reasonable # of things” problems in a game! 

• Removing cache misses + SIMD ALU can be huge win 

• Solves “death by a thousand cuts” problems 

• This type of transform typically takes it off the radar 
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Techniques & Tricks 

• Need to cope with messy data & constraints 

• Want largest possible scope for SIMD throughout codebase 
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• Need to cope with messy data & constraints 

• Want largest possible scope for SIMD throughout codebase 

• We’ll look at two tricks 

• Left packing 

• Dynamic mask generation 
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Techniques & Tricks 

• Need to cope with messy data & constraints 

• Want largest possible scope for SIMD throughout codebase 

• We’ll look at two tricks 

• Left packing 

• Dynamic mask generation 

• SSSE3+ greatly expands the trick repertoire 

• We'll start with SSSE3+ features 

• We'll come back to SSE2 (the dark ages) after that.. 
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Problem: Filtering Data 

• Discarding data while streaming 

• Not a 1:1 relationship between input and output 

• N inputs, M outputs, M <= N 

• Not writing multiple of SIMD register width to output! 


Want to express as SIMD kernel, but how? 
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Scalar Filtering 


int FilterFloats_Reference(const float input[], float output[], 

int count, float limit) 


{ 


float *outputp = output; 


for (int i = 0 ; i < count; ++i) { 
if (input[i] >= limit) 
*outputp++ = input[i]; 

} 


return (int) (outputp - output); 

} 
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Scalar Filtering 


int FilterFloats_Reference(const float input[], float output[], 

int count, float limit) 


{ 


float *outputp = output; 


f or (int i = 0 ; i < count ; ++ i) { 
if (input[i] >= limit) 
*outputp++ = input[i]; 

} 


return (int) (outputp - output); 

} 
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SIMD Filtering Skeleton.. 

for (int i = 0; i < count; i += 4) { 

ml28 val = _mm_locid_ps (input + i); 

ml28 mask = _mm_cmpge_ps(val , _mm_setl_ps(limit)) ; 

ml28 result = LeftPack(mask, val); 

_mm_storeu_ps(output , result); 

output += _popcnt(_mm_movemask_ps(mask)) ; 
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SIMD Filtering Skeleton.. 

for Tint i =_0; i < count; i += 4) -f 

ml28 val = _mm_load_ps(input + i); 

ml28 mask = _mm_cmpge_ps(val , _mm_setT_ps(li.mit)); 

ml28 result = LeftPack(mask, val); 

_mm_storeu_ps(output , result); 

output += _popcnt(_mm_movemask_ps(masl<)) ; 

} 


Load 4 floats 
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SIMD Filtering Skeleton.. 

for (int i = 0; i < count; i += 4) { 

ml28 val = _mm_load_ps (input + i) ; 

ml28 mask = _mm_cmpge_ps(val , _mm_setl_ps(limit)) ; 


ml28 result = LeftPack(mask, val); 

_mm_storeu_ps(output , result); 

output += _popcnt(_mm_movemask_ps(masl<)) ; 


Perform 4 compares => mask 
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SIMD Filtering Skeleton.. 

for (int i = 0; i < count; i += 4) { 

ml28 val = _mm_locid_ps (input + i); 

ml28 mask = _mm_cmpge_ps(val , _mm_setl_ps(limit)) ; 


m!28 result = LeftPack(mask, val); 


_mm_storeu_ps(output , result); 

output += _popcnt(_mm_movemask_ps(masl<)) ; 


Left-pack valid elements to front of register 
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SIMD Filtering Skeleton.. 

for (int i = 0; i < count; i += 4) { 

ml28 val = _mm_locid_ps (input + i); 

ml28 mask = _mm_cmpge_ps(val , _mm_setl_ps(limit)) ; 

m!28 result = LeftPack(mask, val); 


_mm_storeu_ps(output , result); 


output += _popcnt(_mm_movemask_ps(masl<)) ; 

} 


Store unaligned to current output position 
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SIMD Filtering Skeleton.. 

for (int i = 0; i < count; i += 4) { 

ml28 val = _mm_locid_ps (input + i); 

ml28 mask = _mm_cmpge_ps(val , _mm_setl_ps(limit)) ; 

ml28 result = LeftPack(mask, val); 

_mm_storeu_ps(output , result); 


output += _popcnt(_mm_movemask_ps(masl<)) ; 

} 


Advance output position based on mask 
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SIMD Filtering Skeleton.. 

for (int i = 0; i < count; i += 4) { 

ml28 val = _mm_locid_ps (input + i); 

ml28 mask = _mm_cmpge_ps(val , _mm_setl_ps(limit)) ; 

ml28 result = LeftPack(mask, val); 

_mm_storeu_ps(output , result); 

output += _popcnt(_mm_movemask_ps(mask)) ; 
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Left Packing Problem (4-wide, limit=0) 


Input 


0 1 2 


1 1-115 


Mask 
Left Pack 



7 


3 


Output 

O 
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Left Packing Problem (4-wide, limit=0) 


Input 

Mask 
Left Pack 


o 


1 

✓ 



2 3 


JJ 

Q. 


K 


4 


-2 



7 


3 


Output 

O 
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Left Packing Problem (4-wide, limit=0) 


Input 

Mask 
Left Pack 



2 3 4 


5 13 1-2 


✓ 

3 
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Output 
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Don’t Care 
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Left Packing Problem (4-wide, limit=0) 


Input 

Mask 
Left Pack 
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Left Packing Problem (4-wide, limit=0) 


Input 

Mask 
Left Pack 
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Left Packing Problem (4-wide, limit=0) 
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Left Packing Problem (4-wide, limit=0) 
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Left Packing Problem (4-wide, limit=0) 
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Left Packing Problem (4-wide, limit=0) 
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Left Packing (SSSE3+) 

• _mm_movemask_psO = bit mask of valid lanes 

• Value in the range 0-1 5 
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Left Packing (SSSE3+) 

• _mm_movemask_psO = bit mask of valid lanes 

• Value in the range 0-1 5 

• Leverage indirect shuffle via PSHUFB 

• a.k.a _mm_shuffle_epi8Q 
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Left Packing (SSSE3+) 

• _mm_movemask_psO = bit mask of valid lanes 

• Value in the range 0-1 5 

• Leverage indirect shuffle via PSHUFB 

• a.k.a _mm_shuffle_epi80 

• Lookup table of 16 shuffles (4-wide case) 

• Each shuffle moves valid lanes to the left 

• Need 16 x 16 = 256 bytes (4 cache lines) of LUT data 
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Left Packing Code (SSSE3+) 

ml28i LeftPack_SSSE3( ml28 mask, ml28 val) 

{ 

// Move 4 sign bits of mask to 4-bit integer value, 
int mask = _mm_movemask_ps(mask); 

// Select shuffle control data 

ml28i shuf_ctrl = _mm_load_sil28(&shufmasks[masl<]); 

// Permute to move valid values to front of SIMD register 
ml28i packed = _mm_shuffle_epi8(_mm_castps_sil28(val), shuf_ctrl); 

return packed; 

} 
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Left Packing Code (SSSE3+) 


ml28i LeftPack_SSSE3( ml28 mask, ml28 val) 

{ 



// Move 4 sign bits of mask to 4-bit integer value, 
int mask = _mm_movemask_ps(masl<) ; 


// Select shuffle control data 

ml28i shuf_ctrl = _mm_load_sil28(&shufmasks[mask]) ; 


// Permute to move valid values to front of SIMD register 

ml28i packed = _mm_shuffle_epi8(_mm_castps_sil28(val) , shuf_ctrl); 


return packed; 


} 
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Left Packing Code (SSSE3+) 


ml28i LeftPack_SSSE3( ml28 mask, ml28 val) 

{ 

// Move 4 sign bits of mask to 4-bit integer value, 
int mask = _mm_movemask_ps(mask); 



// Select shuffle control data 

m!28i shuf_ctrl = _mm_load_sil28(&shufmasks[mask]) ; 


// Permute to move valid values to front of SIMD register 

m!28i packed = _mm_shuffle_epi8(_mm_castps_sil28(val) , shuf_ctrl); 


return packed; 


} 
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Left Packing Code (SSSE3+) 


ml28i LeftPack_SSSE3( ml28 mask, ml28 val) 

{ 

// Move 4 sign bits of mask to 4-bit integer value, 
int mask = _mm_movemask_ps(mask); 



} 


// Select shuffle control data 

m!28i shuf_ctrl = _mm_load_sil28(&shufmasks[mask]) ; 


// Permute to move valid values to front [of SIMD register 

ml28i packed = _mm_shuffle_epi8(_mm. .castps_sil28(val), shuf_ctrl); 

return packed; 
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80 80 80 80 


80 80 80 80 
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Left Packing Code (SSSE3+) 


ml28i LeftPack_SSSE3( ml28 mask, ml28 val) 

{ 

// Move 4 sign bits of mask to 4-bit integer value, 
int mask = _mm_movemask_ps(mask); 



// Select shuffle control data 

m!28i shuf_ctrl = _mm_load_sil28(&shufmasks[mask]) ; 


// Permute to move valid values to front of SIMD register 

m!28i packed = _mm_shuffle_epi8(_mm_castps_sil28(val) , shuf_ctrl); 


return packed; 


} 
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Left Packing Code (SSSE3+) 


ml28i LeftPack_SSSE3( ml28 mask, ml28 val) 

{ 

// Move 4 sign bits of mask to 4-bit integer value, 
int mask = _mm_movemask_ps(mask); 

// Select shuffle control data 

m!28i shuf_ctrl = _mm_load_sil28(&shufmasks[masl<]); 



// Permute to move valid values to front of SIMD register 
m!28i packed = _mm_shuffle_epi8(_mm_castps_sil28(val) , 


shuf_ctrl) ; 


return packed; 


} 
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Left Packing Code (SSSE3+) 

ml28i LeftPack_SSSE3( ml28 mask, ml28 val) 

{ 

// Move 4 sign bits of mask to 4-bit integer value, 
int mask = _mm_movemask_ps(mask); 

// Select shuffle control data 

ml28i shuf_ctrl = _mm_load_sil28(&shufmasks[masl<]); 

// Permute to move valid values to front of SIMD register 
ml28i packed = _mm_shuffle_epi8(_mm_castps_sil28(val), shuf_ctrl); 

return packed; 

} 
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Problem: Dynamic Low Masking 

• Want mask that isolates n lower bits per lane 

• Useful for dynamic fixed point & many other things 

• Easy in scalar code: (1 « n) - 1 

• No straight forward SSE equivalent 

• n varies across SIMD register 

• No instruction to do variable shifts per lane 
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LUT Low Mask Generation, n = 17 

31 0 


LUT 

00 

01 

03 

07 

0F 

IF 

3F 

7F 

FF 
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LUT Low Mask Generation, n = 17 


31 0 





FF 


Index 0: Clamp(n , 0, 8) =8 


LUT 

00 

01 

03 

07 

0F 

IF 

3F 

7F 

FF 
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LUT Low Mask Generation, n = 17 


31 0 




FF 

FF 


Index 0: Clamp(n , 0, 8) =8 

Index 1: Clamp(n-8 , 0, 8) =8 
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LUT Low Mask Generation, n = 17 


31 0 




01 FF 

FF 



Index 

0: 

Clamp(n , 

0, 

8) 

= 8 

Index 

1: 

Clamp(n-8 , 
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8) 

= 8 

Index 

2: 

Clamp(n-16, 

0, 

8) 

= 1 
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LUT Low Mask Generation, n = 17 


31 0 




00 01 FF 

FF 



Index 

0: 

Clamp(n , 

0, 

8) 

= 8 

Index 

1: 

Clamp(n-8 , 

0, 

8) 

= 8 

Index 

2: 

Clamp(n-16, 

0, 

8) 

= 1 

Index 

3: 

Clamp(n-24, 

0, 

8) 

= 0 
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Dynamic Low Masking (SSSE3+) 

• PSHUFB can be used as nibble->byte lookup 

• 16 parallel lookups 

• Works for low masking because we only have 9 cases 

• Index clamping 

• Use saturated addition 6t subtraction 

• Compute all 16 byte lookup indices in parallel 
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Dynamic Low Masking in Action 

0 4 8 12 
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Dynamic Low Masking in Action 
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Dynamic Low Masking in Action 
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Dynamic Low Masking in Action 
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Dynamic Low Masking in Action 
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Dynamic Low Masking in Action 
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02 

02 


EB 

EF 

F7 

E2 

E9 

FI 

F9 

F7 

F7 

F7 

F7 


Inputs (n) 

Replicate low byte (PSHUFB) 
(Constant CEIL) 

Saturating add CEIL 
(Constant FLOOR) 
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Dynamic Low Masking in Action 


o 


0x09 

09 

09 

09 

09 

DF 

E7 

EF 

F7 


E8 

F0 

F8 

FF 


F7 

F7 

F7 

F7 


4 


0x20 

20 

20 

20 

20 


DF 

E7 

EF 

F7 

FF 

FF 

FF 

FF 






F 7 F 7 F 7 F7 


8 


0x11 

11 

11 

11 

11 


DF 



n 

F0 

F8 

FF 

FF 


F7 

F7 

F7 

F7 


12 


0x02 

02 

02 

02 

02 


EB 

EF 

F7 

E2 

E9 

FI 

F9 

F7 

F7 

F7 

F7 


Inputs (n) 

Replicate low byte (PSHUFB) 
(Constant CEIL) 

Saturating add CEIL 
(Constant FLOOR) 

Saturating subtract FLOOR 


00 00 01 08 


08 08 08 08 


00 01 08 08 


00 00 00 02 
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Dynamic Low Masking in Action 


o 


0x09 

09 

09 

09 

09 

DF 

E7 

EF 

F7 


E8 

F0 

F8 

FF 


F7 

F7 

F7 

F7 

00 

00 

01 

08 


00 

01 

03 

07 


4 


0x20 

20 

20 

20 

20 


DF 

E7 

EF 

F7 

FF 

FF 

FF 

FF 


F7 

F7 

F7 

F7 


08 

08 

08 

08 

0F 

IF 

3F 

7F 


8 




12 


0x02 

02 02 02 02 



Inputs (n) 

Replicate low byte (PSHUFB) 
(Constant CEIL) 

Saturating add CEIL 
(Constant FLOOR) 

Saturating subtract FLOOR 
(Constant LUT) 
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Dynamic Low Masking in Action 


o 


0x09 

09 

09 

09 

09 

DF 

E 7 

EF 

F 7 

E8 

F0 

F8 

FF 

F 7 

F 7 

F 7 

F 7 

00 

00 

01 

08 

00 

01 

03 

07 

0X000001FF 


4 


0x20 

20 

20 

20 

20 


DF 

E7 

EF 

F7 

FF 

FF 

FF 

FF 

F7 

F7 

F7 

F7 


08 

08 

08 

08 

0F 

IF 

3F 

7F 

0XFFFFFFFF 


8 


0x11 

11 

11 

11 

11 

DF 

i 

I 

IL 

F0 

F8 

FF 

FF 

F7 

F7 

F7 

F7 

00 

01 

08 

08 

FF 

? 

• 

? 

• 

? 

• 

0X0001FFFF 


12 


0x02 

02 

02 

02 

02 


Ola 

EF 

F7 

E2 

E9 

FI 

F9 

F7 

F7 

F7 

F7 




00 

00 

00 

02 


0x00000003 


Inputs (n) 

Replicate low byte (PSHUFB) 
(Constant CEIL) 

Saturating add CEIL 
(Constant FLOOR) 

Saturating subtract FLOOR 
(Constant LUT) 

Table lookup (PSHUFB LUT) 
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Dynamic Low Masking Routine (SSSE3) 


ml28i Maskl_owBits_SSSE3( ml28i n) 

{ 

ml28i ii = _mm_shuffle_epi8(n, BYTES); 

ml28i si = _mm_adds_epu8(ii , CEIL); 

si = _mm_subs_epu8(si , FLOOR); 
return _mm_shuffle_epi8(LUT, si); 


} 
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SSE2 Techniques 

• SSE2 is ancient, but fine for basic SOA SIMD 

• Massive speedups still possible 

• Sometimes basic SSE2 will beat SSE4.1 on same HW 

• Emulation in terms of “simpler” instructions can be faster 

• Trickier to do “unusual” things with SSE2 

• Only fixed shuffles 

• Integer support lackluster 



GAME DEVELOPERS CONFERENCE® 2015 


MARCH 2-6, 2015 GDCONF.COM 


SSE2 Left Packing: Move Distances 


No dynamic shuffles 

■ Need divide & conquer 
algorithm 

How far does each lane 
have to travel? 


Mask 

Output 

Move Distances 

0000 

• • • • 

0 

0 

0 

0 

0001 

x. . . 

0 

0 

0 

0 

0010 

Y. . . 

0 

1 

0 

0 

0011 

XY. . 

0 

0 

0 

0 

0100 

Z. . . 

0 

0 

2 

0 

0101 

XZ. . 

0 

0 

1 

0 

0110 

YZ. . 

0 

1 

1 

0 

0111 

XYZ . 

0 

0 

0 

0 

1000 

W. . . 

0 

0 

0 

B 

1001 

xw. . 

0 

0 

0 

2 

1010 

YW. . 

0 

1 

0 

2 

1011 

XYW. 

0 

0 

0 

1 

1100 

ZW. . 

0 

0 

2 

2 

1101 

XZW. 

0 

0 

1 

1 

1110 

YZW. 

0 

1 

1 

1 

1111 

XYZW 

0 

0 

0 

0 
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SSE2 Left Packing: Move Distances 


No dynamic shuffles 

■ Need divide & conquer 
algorithm 

How far does each lane 
have to travel? 


Mask 

Output 

Move Distances 

0000 

• • • • 

0 

0 

0 

0 

0001 

x. . . 

0 

0 

0 

0 

0010 

Y. . . 

0 

i 

0 

0 

0011 

XY. . 

0 

0 

0 

0 

0100 

Z. . . 

0 

0 

2 

0 

0101 

XZ. . 

0 

0 

1 

0 

0110 

YZ. . 

0 

i 

1 

0 

0111 

XYZ . 

0 

0 

0 

0 

1000 

W. . . 

0 

0 

0 

B 








1010 

YW. . 

0 

1 

0 

2 


1100 

ZW. . 

0 

0 

2 


1101 

XZW. 

0 

0 

1 

] 

1110 

YZW. 

0 

1 

1 

] 

1111 

XYZW 

0 

0 

0 

( 
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Left Packing with Move Distances 

• Process move distances (MD) bit by bit 

• Rotate left by 1 - Select based on Bit 0 of MD 

• Rotate left by 2 - Select based on Bit 1 of MD 

• And so on.. 
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Left Packing with Move Distances 

• Process move distances (MD) bit by bit 

• Rotate left by 1 - Select based on Bit 0 of MD 

• Rotate left by 2 - Select based on Bit 1 of MD 

• And so on.. 

• Generalizes to wider registers & more elements 

• log 2 (n) rotates + selects required 

• For example 16-bit left pack, or 8x AVX float left pack 

• 2 for 4-wide case, 3 for 8-wide case, ... 
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( simplified) 
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Left -packing yw. . 



Input 
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simplified) 




Move Distances 
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Left -packing yw. . 



Input 




W 


Rot 1 
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simplified) 




Move Distances 
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Left -packing yw. . 





Input 



Rot 1 
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Left -packing yw. . 





Input 



Rot 1 





Select 
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Left -packing yw. . 


X 

Y 
1 

Y 
Z 




Input 



Rot 1 





Select 



Rot 2 
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Left -packing yw. . 


( simplified) 


X 

Y 
1 

Y 
Z 




W 


Input 



W 








Ll 


h 


r°1 




Move Distances 


W 


Rot 2 




0 


1 


0 


0 


Bit 1 
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Left -packing yw. . 


( simplified) 


X 

Y 
1 

Y 
Z 
0 

Y 





Move Distances 
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Left- packing yw. . 


( simplified) 





W 




W 


Input 
Rot 1 




W 


Rot 2 


0 

Ll 


h 

u 

r°1 







Move Distances 


Store selection masks in LUT 
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SSE2 Left Packing Code (4-wide) 


ml28 PackLeft_SSE2( ml28 mask, ml28 val) 

{ 


int 

valid = 

00 
(XI 
* — 1 
E 

1 

1 

mask0 = 

00 
(XI 
* — 1 
E 

1 

1 

maskl = 

00 
(XI 
* — 1 
E 

1 

1 

s0 

00 
(XI 
* — 1 
E 

1 

1 

r0 

00 
(XI 
* — 1 
E 

1 

1 

si 

00 
(XI 
* — 1 
E 

1 

1 

rl 

return 

rl; 


_mm_movemask_ps(mask) ; 

_mm_load_ps(( loat *)(&g_Masks [valid] [ 0 ])) ; 

_mm_load_ps(C loat *)(&g_Masks [valid] [ 4 ])) ; 

_mm_shuffle_ps(val , val, _MM_SHUFFLE(0, 3 , 2 , 1 )); 
_mm_or_ps(_mm_and_ps(mask0, s0), _mm_andnot_ps(mask0, val)); 

_mm_shuffle_ps(r0, r0, _MM_SHUFFLE(1, 0 , 3 , 2 )); 
_mm_or_ps(_mm_and_ps(maskl, si), _mm_andnot_ps(maskl, r0)); 
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SSE2 Left Packing Code (4-wide) 


ml28 PackLeft_SSE2( ml28 mask, ml28 val) 



ml28 maskO = _mm_load_ps(( loat *)(&g_Masks [valid] [ 0 ])) ; 

ml28 maskl = _mm_load_ps(( loat *)(&g_Masks [valid] [ 4 ])) ; 


ml28 S0 = _mm_shuffle_ps(val , val, _MM_SHUFFLE(0 , 3 , 2 , 1 )); 

ml28 r0 = _mm_or_ps(_mm_and_ps(mask0, s0), _mm_andnot_ps(mask0, val)); 

ml28 si = _mm_shuffle_ps(r0, r0, _MM_SHUFFLE(1, 0 , 3 , 2 )); 

ml28 r 1 = _mm_or_ps(_rtim_and_ps(maskl, si), _mm_andnot_ps(maskl, r0)); 

return rl; 

} 


Grab mask of valid elements 
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SSE2 Left Packing Code (4-wide) 

ml28 PackLeft_SSE2( ml28 mask, ml28 val) 

{ 

int valid = _mm_movemask_ps(mask) ; 


ml28 maskO = _mm_load_ps(( loat *)(&g_Masks [valid] [ 0 ])) ; 

ml28 maskl = _mm_load_ps(( loat *)(&g_Masks [valid] [ 4 ])) ; 


ml28 S0 = _mm_shuffle_ps(val , val, _MM_SHUFFLE(0 , 3 , 2 , 1 )); 

ml28 r0 = _mm_or_ps(_mm_and_ps(mask0, s0), _mm_andnot_ps(mask0, val)); 

ml28 si = _mm_shuffle_ps(r0, r0, _MM_SHUFFLE(1, 0 , 3 , 2 )); 

ml28 r 1 = _mm_or_ps(_rtim_and_ps(maskl, si), _mm_andnot_ps(maskl, r0)); 

return rl; 

} 


Load precomputed selection masks from LUT 
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SSE2 Left Packing Code (4-wide) 

ml28 PackLeft_SSE2( ml28 mask, ml28 val) 

{ 

int valid = _mm_movemask_ps(mask) ; 

ml28 mask0 = _mm_load_ps(( loat *)(&g_Masks [valid] [ 0 ])); 

ml28 maskl = _mm_load_psCC loat *)(&g_Masks[valid] [ 4 ])); 


ml28 S0 = _mm_shuffle_ps(val , val, _MM_SHUFFLE(0, 3, 2, 1)); 

ml28 r0 = _mm_or_ps(_mm_and_ps(mask0, s0), _mm_andnot_ps(mask0, val)); 


ml28 si = _mm_shuffle_ps(r0, r0, _MM_SHUFFLE(1, 0, 3, 2)); 

ml28 r 1 = _mm_or_ps(_mm_and_ps(maskl, si), _mm_andnot_ps(maskl, r0)); 

return rl; 

} 


First round of rotate+select 
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SSE2 Left Packing Code (4-wide) 

ml28 PackLeft_SSE2( ml28 mask, ml28 val) 

{ 

int valid = _mm_movemask_ps(mask) ; 

ml28 mask0 = _mm_load_ps(( loat *)(&g_Masks [valid] [ 0 ])); 

ml28 maskl = _mm_load_psCC loat *)(&g_Masks[valid] [ 4 ])); 

ml28 s0 = _mm_shuffle_ps(val , val, _MM_SHUFFLE(0, 3 , 2 , 1 )); 

ml28 r0 = _mm_or_ps(_mm_and_ps(mask0, s0), _mm_andnot_ps(mask0, val)); 


ml28 si = _mm_shuffle_ps(r0, r0, _MM_SHUFFLE(1, 0 , 3 , 2 )); 

ml28 rl = _mm_or_psC_mm_and_psCmaskl, si), _mm_andnot_ps(maskl, r0)); 


} 


return rl; 


Second round of rotate+select 
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SSE2 Left Packing Code (4-wide) 


ml28 PackLeft_SSE2( ml28 mask, ml28 val) 

{ 


int 

valid = 

00 
(XI 
* — 1 
E 

1 

1 

mask0 = 

00 
(XI 
* — 1 
E 

1 

1 

maskl = 

00 
(XI 
* — 1 
E 

1 

1 

s0 

00 
(XI 
* — 1 
E 

1 

1 

r0 

00 
(XI 
* — 1 
E 

1 

1 

si 

00 
(XI 
* — 1 
E 

1 

1 

rl 

return 

rl; 


_mm_movemask_ps(mask) ; 

_mm_load_ps(( loat *)(&g_Masks [valid] [ 0 ])) ; 

_mm_load_ps(C loat *)(&g_Masks [valid] [ 4 ])) ; 

_mm_shuffle_ps(val , val, _MM_SHUFFLE(0, 3 , 2 , 1 )); 
_mm_or_ps(_mm_and_ps(mask0, s0), _mm_andnot_ps(mask0, val)); 

_mm_shuffle_ps(r0, r0, _MM_SHUFFLE(1, 0 , 3 , 2 )); 
_mm_or_ps(_mm_and_ps(maskl, si), _mm_andnot_ps(maskl, r0)); 
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Dynamic Low Masking in SSE2 

• Recall IEEE floating point format 

• sign * 2 ex P° nent * mantissa 
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Dynamic Low Masking in SSE2 

• Recall IEEE floating point format 

• sign * 2 ex P° nent * mantissa 

• That exponent sure looks like a shifter.. 

• 2 n = 1 << n 
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Dynamic Low Masking in SSE2 

• Recall IEEE floating point format 

• sign * 2 ex P° nent * mantissa 

• That exponent sure looks like a shifter.. 

• 2 n = 1 << n 

• Idea: 

• Craft special float by populating exponent with biased n 

• Convert to integer, then subtract 1 
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Overflow woes 

• Conversion from float to int is signed 
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Overflow woes 

• Conversion from float to int is signed 

• When n >= 31, can’t fit in signed integer 

• INT_MAX = 0x7f f f ff f f 

• Overflow is clamped to “integer indeterminate” 
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Overflow woes 

• Conversion from float to int is signed 

• When n >= 31, can't fit in signed integer 

• INT_MAX = 0x7fffffff 

• Overflow is clamped to “integer indeterminate” 

• Which happens to be.. 0x80000000 

• Exactly what we need for n=31 

• n > 31 will clamp to 31 
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Dynamic Low Masking in SSE2 

ml28i Maskl_owBits_SSE2( ml28i n) 

{ 

SSE_C0NSTANT_4(c_l, uint32_t, 1); 

SSE_C0NSTANT_4(c_127, uint32_t, 127); 

ml28i exp = _mm_add_epi32(n, c_127); 

ml28i fltv = _mm_slli_epi32(exp, 23); 

ml28i intv = _mm_cvtps_epi32(_mm_castsil28_ps(fltv)); 

return 


} 


_mm_sub_epi32(intv, c_l); 
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Dynamic Low Masking in SSE2 

ml28i Maskl_owBits_SSE2( ml28i n) 

{ 

SSE_C0NSTANT_4(c_l, uint32_t, 1); 

SSE_C0NSTANT_4(c_127, uint32_t, 127); 

ml28i exp = _mm_add_epi32(n, c_127); 

ml28i fltv = _mm_slli_epi32(exp, 23); 

ml28i intv = _mm_cvtps_epi32(_mm_castsil28_ps(fltv)); 

return _mm_sub_epi32(intv, c_l); 


Add 1 27 to generate biased exponent 


} 
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Dynamic Low Masking in SSE2 

ml28i Maskl_owBits_SSE2( ml28i n) 

{ 

SSE_C0NSTANT_4(c_l, uint32_t, 1); 

SSE_C0NSTANT_4(c_127, uint32_t, 127); 

ml28i exp = _mm_add_epi32(n, c_127); 

ml28i fltv = _mm_slli_epi32(exp, 23); 

ml28i intv = _mm_cvtps_epi32(_mm_castsil28_ps(fltv)); 

return _mm_sub_epi32(intv, c_l); 


Move exponent into place to make it pass as a float 


} 
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Dynamic Low Masking in SSE2 

ml28i Maskl_owBits_SSE2( ml28i n) 

{ 

SSE_C0NSTANT_4(c_l, uint32_t, 1); 

SSE_C0NSTANT_4(c_127, uint32_t, 127); 

ml28i exp = _mm_add_epi32(n, c_127); 

ml28i fltv = _mm_slli_epi32(exp, 23); 

ml28i intv = _mm_cvtps_epi32(_mm_castsil28_ps(fltv)); 

return _mm_sub_epi32(intv, c_l); 


Convert the float to an int yielding 2 A n as an integer 


} 
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Dynamic Low Masking in SSE2 

ml28i Maskl_owBits_SSE2( ml28i n) 

{ 

SSE_C0NSTANT_4(c_l, uint32_t, 1); 

SSE_C0NSTANT_4(c_127, uint32_t, 127); 

ml28i exp = _mm_add_epi32(n, c_127); 

ml28i fltv = _mm_slli_epi32(exp, 23); 

ml28i intv = _mm_cvtps_epi32(_mm_castsil28_ps(fltv)); 

return _mm_sub_epi32(intv, c_l); 


Subtract one to generate mask 


} 
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Dynamic Low Masking in SSE2 

ml28i Maskl_owBits_SSE2( ml28i n) 

{ 

SSE_C0NSTANT_4(c_l, uint32_t, 1); 

SSE_C0NSTANT_4(c_127, uint32_t, 127); 

ml28i exp = _mm_add_epi32(n, c_127); 

ml28i fltv = _mm_slli_epi32(exp, 23); 

ml28i intv = _mm_cvtps_epi32(_mm_castsil28_ps(fltv)); 

return 


} 


_mm_sub_epi32(intv, c_l); 




GAME DEVELOPERS CONFERENCE® 2G15 


MARCH 2-6, 2015 GDCONF.COM 


Best Practices: Branching 

• Guideline: Avoid branches in general 

• Mispredicted branch still extremely costly on most H/W 

• Don't want hard-to-predict branches in inner loops 
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Best Practices: Branching 

• Guideline: Avoid branches in general 

• Mispredicted branch still extremely costly on most H/W 

• Don't want hard-to-predict branches in inner loops 

• Can be OK to branch if very predictable 

• Branch should be predicted correctly 99+% to make sense 

• E.g. a handful of expensive things in a sea of data 

• Use _mm_movemask_X()+ if on SSE2 

• Consider: _mm_testz_sil28() and friends on SSE4.1 + 
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Alternatives to Branching 

• GPU-style “compute both branches” + select 

• Works fine for many smaller problems 

• Start here for small branches 
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Alternatives to Branching 

• GPU-style “compute both branches” + select 

• Works fine for many smaller problems 

• Start here for small branches 

• Separate input data + kernel per problem 

• Yields best performance when possible 
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Alternatives to Branching 

• GPU-style “compute both branches” + select 

• Works fine for many smaller problems 

• Start here for small branches 

• Separate input data + kernel per problem 

• Yields best performance when possible 

• Consider partitioning index sets 

• Run fast kernel to partition index data into multiple sets 

• Run optimized kernel on each subset 

• Prefetching can be useful unless most indices are visited 
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Best Practices: Prefetching 

• Absolutely necessary on previous generation 

• Not a good idea to carry this forward blindly to x86 
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Best Practices: Prefetching 

• Absolutely necessary on previous generation 

• Not a good idea to carry this forward blindly to x86 

• Guideline: Don’t prefetch linear array accesses 

• Can carry a heavy TLB miss cost chance on some H/W 

• The chip is already prefetching at the cache level for free 
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Best Practices: Prefetching 

• Absolutely necessary on previous generation 

• Not a good idea to carry this forward blindly to x86 

• Guideline: Don’t prefetch linear array accesses 

• Can carry a heavy TLB miss cost chance on some H/W 

• The chip is already prefetching at the cache level for free 

• Guideline: Maybe prefetch upcoming ptrs/indices 

• IF: you know they will be far enough apart/irregular 

• Prefetch instructions vary somewhat between AMD/Intel 

• Test carefully that you’re getting benefit on all H/W 
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Best Practices: Unrolling 

• Common in VAAX128/SPU style code 

• Made a lot of sense with in-order machines to hide latency 

• Also had lots of registers! 
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• Only 16 (named) registers - H/W has many more internally 

• Out of order execution unrolls for you to some extent 
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Best Practices: Unrolling 

• Common in VAAX128/SPU style code 

• Made a lot of sense with in-order machines to hide latency 

• Also had lots of registers! 

• Generally not a good idea for SSE/AVX 

• Only 16 (named) registers - H/W has many more internally 

• Out of order execution unrolls for you to some extent 

• Guideline: Unroll only up to full register width 

• E.g. unroll 2x 64-bit loop to get 128 bit loop, but no more 

• Can make exceptions for very small loops as needed 
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Best Practices: Streaming R/W 

• Do use streaming reads (>SSE 4.1) and writes 

• Helps avoid cache trashing 

• Especially for kernels using large lookup tables 
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Best Practices: Streaming R/W 

• Do use streaming reads (>SSE 4.1) and writes 

• Helps avoid cache trashing 

• Especially for kernels using large lookup tables 

• But don’t forget to fence!! 

• Different options for different architectures 

• _mm_mfenceO always works but is slow 

• Streaming sidesteps strong x86 memory model 

• Subtle data races will happen if you don't fence 
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Conclusion: It’s not magic 

• You too can be a performance hero! 

• Small investments can yield substantial benefits 

• Modem SSE is not bad at all 

• Not a lot of best practices out there 

• Hopefully this talk gives you something to start with! 
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SSE and AVX Resources 

• ISPC - http://ispc.github.io 

• Intel Instrinsics Guide 

• https://software.intel.com/sites/landingpage/lntrinsicsGuide 

• Available as Dash DocSet for Mac OS X by yours truly 

• Intel Architecture Code Analyzer 

• https://software.intel.com/en-us/articles/intel-architecture-code-analyzer 

• Agner Fog’s instruction timings 

• http://WWW.agner.0rg/0ptimize/#manualS 
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Q&A 

Twitter: @deplinenoise 

Email: afredriksson@insomniacgames.com 

Special thanks: 

Fabian Giesen 
Mike Day 
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SSE in 2015: Where are we? 

• SSE on x64 with modern feature set is not bad 

• Has a lot of niceties, especially in SSE 4.1 and later 

• Support heavily fragmented on PC consumer machines 
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• SSE on x64 with modern feature set is not bad 

• Has a lot of niceties, especially in SSE 4.1 and later 
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• Limitation #1: Only 16 registers (x64) 

• Easy to blow with a single splat’d 4x4 matrix! 

• Carefully check generated assembly from intrinsics 
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SSE in 2015: Where are we? 

• SSE on x64 with modern feature set is not bad 

• Has a lot of niceties, especially in SSE 4.1 and later 

• Support heavily fragmented on PC consumer machines 

• Limitation #1: Only 16 registers (x64) 

• Easy to blow with a single splat’d 4x4 matrix! 

• Carefully check generated assembly from intrinsics 

• Limitation #2: No dynamic two-register shuffles 

• Challenge when porting Altivec/SPU style code 
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SSE Goodies since SSE2 


Technology 

Goodies 

SSSE3 

PSHUFB, Integer Abs 

SSE4. 1 

32-bit low mul, Blend, Integer Min+Max, Insert + 
Extract, PTEST, PACKUSDW, ... 

SSE4.2 

(POPCNT has its own CPUID flas) 

POPCNT (only) 
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SSE Fragmentation Nov 2014 

Data kindly provided by Unity 


Technology 

Web Player 

Unity Editor 

Year Introduced 

SSE2 

100% 

100% 

2001 

SSE3 

100% 

100% 

2004 

SSSE3 

75% 

93% 

2006 

SSE4.1 

51% 

83% 

2007 

SSE4.2 

44% 

77% 

2008 

AVX 

23% 

61% 

2011 

AVX2 

4% 

19% 

2013 
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So.. What about AVX? 

• Great when supported on Intel chips! 

• 2x gain for compute bound problems 

• Can easily become memory bound for simpler problems! 



GAME DEVELOPERS CONFERENCE® 2D15 


MARCH 2-6, 2015 GDCONF.COM 


So.. What about AVX? 

• Great when supported on Intel chips! 

• 2x gain for compute bound problems 

• Can easily become memory bound for simpler problems! 

• Low availability in PC consumer space 



GAME DEVELOPERS CONFERENCE® 2D15 


MARCH 2-6, 2015 GDCONF.COM 
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• Great when supported on Intel chips! 

• 2x gain for compute bound problems 

• Can easily become memory bound for simpler problems! 

• Low availability in PC consumer space 

• Crippled on AMD micro architectures 

• Splits to 2 x 128 bit ALU internally (high latency) 
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So.. What about AVX? 

• Great when supported on Intel chips! 

• 2x gain for compute bound problems 

• Can easily become memory bound for simpler problems! 

• Low availability in PC consumer space 

• Crippled on AMD micro architectures 

• Splits to 2 x 128 bit ALU internally (high latency) 

• Not worth it for us, except for some PC tools 
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Cross- platform SSE in practice 

• Full SSE4+ with all bells and whistles on consoles 

• Blend, population count, half<->float, ... 

• VEX prefix encoding = free performance 
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• Still need SSE2/3 compatibility for PC builds 

• Tools (and games) running on older PCs 

• Dedicated cloud servers with ancient SSE support 
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Cross- platform SSE in practice 

• Full SSE4+ with all bells and whistles on consoles 

• Blend, population count, half<->float, ... 

• VEX prefix encoding = free performance 

• Still need SSE2/3 compatibility for PC builds 

• Tools (and games) running on older PCs 

• Dedicated cloud servers with ancient SSE support 

• Straightforward to emulate most SSE4+ insns 

• Establish wrappers early for cross-platform projects 
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Data Layout Recap 

• Two basic choices 

• AOS - Array of Structures 

• SO A - Structure of Arrays 

• Hybrid layouts possible 
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Data Layout Recap 

• Two basic choices 

• AOS - Array of Structures 

• SO A - Structure of Arrays 

• Hybrid layouts possible 

• Most scalar code tends to be AOS 

• C++ structs and classes make that design choice implicitly 

• Clashes with desire to use SIMD instructions 

• This is probably 75% of the work to fix/compensate for 
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AOS Data 


Elem 0 
Elem 1 
Elem 2 
Elem 3 


Unrelated 

X 

Y 

Unrelated 

X 

Y 

Unrelated 

X 

Y 

Unrelated 

X 

Y 



• • • 


MARCH 2-6, 2015 GDCONF.COM 



Unrelated 


Unrelated 


Unrelated 


Unrelated 
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SOA Data 



0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

• • • 


Xs 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

Ys 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Zs 

Z 

Z 

Z 

Z 

Z 

Z 

Z 

Z 

Z 

Z 

Z 

Z 


Other. . 


Unrelated 

Unrelated 

Unrelated 

• • • 
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Hybrid: Tiled storage (x4) 


Elems 0..3 


Unrelated 






Y 


Y 


Y 


Y 


Z 


Z 


Z 


Z 


Unrelated 


Elems 4.. 7 


Unrelated 

X 

X 

X 

X 

Y 

Y 

Y 

Y 

Z 

Z 

Z 

Z 

Unrelated 








